Sparse Models of Natural Language Text

نویسنده

  • Dani Yogatama
چکیده

In statistical text analysis, many learning problems can be formulated as a minimization of a sum of a loss function and a regularization function for a vector of parameters (feature coefficients). The loss function drives the model to learn generalizable patterns from the training data, whereas the regularizer plays two important roles: to prevent the models from capturing idiosyncrasies of the training data (overfitting) and to encode prior knowledge about the model parameters. When learning from high-dimensional data such as text, it has been empirically observed that relatively few dimensions are relevant to the predictive task (Forman, 2003). How can we capitalize on this insight and choose which dimensions are relevant in an informed and principled manner? Sparse regularizers provide a way to select relevant dimensions by means of regularization. However, past work rarely encodes non-trivial prior knowledge that yields sparse solutions through a regularization function. This thesis investigates the applications of sparse models—especially structured sparse models—as a medium to encode linguistically-motivated prior knowledge in textual models to advance NLP systems. We explore applications of sparse NLP models in text categorization, word embeddings, and temporal models of text. Sparse models come with their own challenges, since new instantiations of sparse models often require a specialized optimization method. This thesis also presents optimization methods for the proposed instantiations of sparse models. Therefore, the goals of this thesis are twofold: (i) to show how sparsity can be used to encode linguistic information in statistical text models, and (ii) to develop efficient learning algorithms to solve the resulting optimization problems.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Research Statement — Dani Yogatama

I design algorithms for intelligent processing of natural language texts—for example, to extract factual information into a structured database (e.g., extracting headquarters locations, CEOs, and phone numbers of companies from text into a database) or to predict real-world events from text (e.g., scientific trends, disease outbreaks). These applications require models of text that scale to lar...

متن کامل

Search Techniques for Learning Probabilistic Models of Word Sense Disambiguation

The development of automatic natural language understanding systems remains an elusive goal. Given the highly ambiguous nature of the syntax and semantics of natural language, it is not possible to develop rule-based approaches to understanding even very limited domains of text. The difficulty in specifying a complete set of rules and their exceptions has led to the rise of probabilistic approa...

متن کامل

Text-Driven Forecasting

Forecasting the future hinges on understanding the present. The web—particularly the social web—now gives us an up-to-the-minute snapshot of the world as it is and as it is perceived by many people, right now, but that snapshot is distributed in a way that is incomprehensible to a human. Much of this data is encoded in text, which is noisy, unstructured, and sparse; yet recent developments in n...

متن کامل

Large-Scale Bayesian Logistic Regression for Text Categorization

Logistic regression analysis of high-dimensional data, such as natural language text, poses computational and statistical challenges. Maximum likelihood estimation often fails in these applications. We present a simple Bayesian logistic regression approach that uses a Laplace prior to avoid overfitting and produces sparse predictive models for text data. We apply this approach to a range of doc...

متن کامل

Subspace Representations of Unstructured Text

Since 1970 vector-space models have been used for information retrieval from unstructured text. The initial simple vector-space models suffered the same problems encountered today in searching the internet. These difficulties were significantly relieved by Latent Semantic Indexing (LSI), introduced in 1990 and improved through 1995. Starting with the simple vector-space model’s sparse term-by-d...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015